AI bioinformatics: avoiding command line hallucinations

AI bioinformatics

In two previous articles we found that LLMs are very good at explaining complex bioinformatics commnd lines, but not good at identifying incorrect or meaningless arguments.

Now for the natural follow up experiment: what tweaks can we make to the settings or prompts to increase the chances of catching such errors? We will try a few different approachs, and test each one with the same three incorrect prompts that failed to be flagged previously. As a reminder, they are:

All experiments here are done with GPT5.2.

Turning on thinking mode

We will start with a very easy tweak and turn on thinking mode in the model options. This causes the model to output more reasoning tokens before responding.

This simple tweak is surprisingly effective. It causes the model to correctly identify the problem with the first bwa command and flag it directly:

  • Important: standard bwa mem does not have a --parallel option.

For the second bwa command, thinking mode makes no difference - the model fails to recognise that this isn’t a valid option for bwa mem.

For the blastn command, thinking mode gives a partial success: the model still makes up an explanation, but we get the sense that it might not be exactly right:

  • --precise Requests a more precise/strict search mode.

    • Practical effect: typically aims to improve exactness/consistency of results (often at a speed cost).
    • Note: --precise is not a traditional “classic BLAST+” option in many installations; if your blastn errors with “unrecognized option”, it may be coming from a specific wrapper/build/version. If it works for you, keep it; if not, remove it.

Want to see future experiments? sign up for the mailing list:

Making errors more prominent

Something that has emerged from this set of experiments is that there are really two questions we are interested in: does the model identify the incorrect argument, and does it flag it up in a human-readable way? Once we have thinking and web search turned on, the model is doing a pretty good job at the former. However, the information about the incorrect argument is not given any special place in the output, so it would be easy for a human reader to miss.

Let’s try modifying the prompt to explicity check for incorrect arguments:

Explain the options and arguments in this command line:

blastn -query queries.fasta -db nt -remote –precise -outfmt “6 qseqid sseqid staxids pident length evalue bitscore” -max_target_seqs 5 -evalue 1e-6 -num_threads 8 -out results.tsv

include details on the roles of each file name. If any of the arguments are incorrect or invalid, report that first and then stop.

Given these explicit instructions, the model surfaces the error much better:

❗ Invalid Option

Before explaining any of the arguments, there is one invalid option in the command:

  • –precise — This is not a valid BLAST+ option in blastn. … ## 🧠 Key Takeaway

Everything in the command is valid except –precise, which should not be included. Remove that flag and the command should run (given access to the database and connectivity to the remote server). ([NCBI][1])

The error is also given much more prominence in the response when asked about the --skip-errors command in bwa mem:

The option --skip-errors is not a valid BWA-MEM option. I could not find any evidence that bwa mem supports --skip-errors in its documented option list — all standard options use a single dash (e.g., -t, -R, -M) and there is no built-in --skip-errors documented. ([Debian Manpages][1])

Invalid Argument

  • --skip-errors is not a recognized BWA-MEM option. — You must remove it for the command to run.

Summary

When using LLMs for the task of explaining complex command lines - something that they generally excel at! - we can increase the chance of catching errors by turning on both thinking and web search. We can also increase the prominence of error reporting by tweaking the prompt to make it explicit.

These are probably best practice for using AI in this way, especially since in this set of examples we have used two very widely-discussed tools. More obscure tools are less likely to be represented well in the training set, and so errors would be harder to catch.

If you are interested in doing your own experiments with web search, remember to use different example commands to mine, as this set of articles will easily show up for web searches if you use the same incorrect arguments!

One thing that we haven’t tried so far is to use one of the agentic models that can run the command line and load the output into context. That will be an interesting experiment for the future, as it won’t rely on either knowledge directly embedded in the model, or on context available from a web search.